Annotating Cognates and Etymological Origin in Turkic Languages

نویسندگان

  • Benjamin S. Mericli
  • Michael Bloodgood
چکیده

Turkic languages exhibit extensive and diverse etymological relationships among lexical items. These relationships make the Turkic languages promising for exploring automated translation lexicon induction by leveraging cognate and other etymological information. However, due to the extent and diversity of the types of relationships between words, it is not clear how to annotate such information. In this paper, we present a methodology for annotating cognates and etymological origin in Turkic languages. Our method strives to balance the amount of research effort the annotator expends with the utility of the annotations for supporting research on improving automated translation lexicon induction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian

In this paper we propose a computational method for determining the orthographic similarity between Romanian and related languages. We account for etymons and cognates and we investigate not only the number of related words, but also their forms, quantifying orthographic similarities. The method we propose is adaptable to any language, as far as resources are available.

متن کامل

Automatic Construction of Weighted String Similarity Measures

String similarity metrics are used for several purposes in text-processing. One task is the extraction of cognates from bilingual text. In this paper three approaches to the automatic generation of language dependent string matching functions are presented. 1 I n t r o d u c t i o n String similarity metrics are extensively used in the processing of textual data for several purposes such as the...

متن کامل

The Genetic Legacy of the Expansion of Turkic-Speaking Nomads across Eurasia

The Turkic peoples represent a diverse collection of ethnic groups defined by the Turkic languages. These groups have dispersed across a vast area, including Siberia, Northwest China, Central Asia, East Europe, the Caucasus, Anatolia, the Middle East, and Afghanistan. The origin and early dispersal history of the Turkic peoples is disputed, with candidates for their ancient homeland ranging fro...

متن کامل

Etymological Wordnet: Tracing The History of Words

Research on the history of words has led to remarkable insights about language and also about the history of human civilization more generally. This paper presents the Etymological Wordnet, the first database that aims at making word origin information available as a large, machine-readable network of words in many languages. The information in this resource is obtained from Wiktionary. Extract...

متن کامل

String Similarity Measures and PAM-like Matrices for Cognate Identification

We present a new automatic learning system for the identification of cognates, words that derive from a common ancestor and share the same etymological origin. Our approach combines and adapts several techniques developed for biological sequence analysis to the natural language processing environment. We design a linguistic-inspired matrix to align sensibly our training dataset. We introduce a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1501.03191  شماره 

صفحات  -

تاریخ انتشار 2014